USHIFT-6401, USHIFT-6788: Add fail-fast RBAC bootstrap hook deadlock detection by copejon · Pull Request #6471 · openshift/microshift

copejon · 2026-04-07T16:05:20Z

Summary

Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver
Detect deadlock condition in ~15 seconds instead of waiting 60 seconds
Restart microshift-etcd.scope to recover from deadlock
USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
Checks if etcd is healthy while the hook is stuck
If etcd is healthy but hook not completing for 15 seconds, declares deadlock
Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)
Restarts microshift-etcd.scope to force clean recovery

Test plan

Start MicroShift fresh - verify normal startup
Stop and restart MicroShift - verify no crash loop
If deadlock occurs, verify detection in ~15 seconds
Verify "RBAC bootstrap hook deadlock detected" log message on recovery
Verify wall-clock deadline prevents indefinite flapping extension
Verify make verify-go passes
Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve USHIFT-6401, [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Summary by CodeRabbit

Bug Fixes
- Improved startup reliability by detecting RBAC bootstrap deadlocks via time‑bounded monitoring of readiness endpoints and etcd health; confirms issues before acting to reduce false positives and performs a bounded service restart attempt to recover initialization.

openshift-ci-robot · 2026-04-07T16:05:31Z

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Details

In response to this:

Summary

Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver

Detect deadlock condition in ~15 seconds instead of waiting 60 seconds

Restart microshift-etcd.scope to recover from deadlock

Pure MicroShift-level fix without vendored code changes

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically

Checks if etcd is healthy while the hook is stuck

If etcd is healthy but hook not completing for 15 seconds, declares deadlock

Restarts microshift-etcd.scope to force clean recovery

Test plan

Start MicroShift fresh - verify normal startup

Stop and restart MicroShift - verify no crash loop

If deadlock occurs, verify detection in ~15 seconds

Verify "RBAC bootstrap hook deadlock detected" log message on recovery

Verify make verify-go passes

Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-07T16:05:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds a concurrent RBAC bootstrap-hook deadlock detector to kube-apiserver.Run that probes the poststarthook readiness endpoint and etcd health, signals confirmed deadlocks, attempts recovery by stopping microshift-etcd.scope, and causes Run to return a formatted deadlock error.

Changes

Cohort / File(s)	Summary
RBAC Deadlock Detection `pkg/controllers/kube-apiserver.go`	Converts kubeAPIStartupTimeout to a duration; adds `detectRBACHookDeadlock` goroutine and `rbacDeadlockDetected` channel; implements periodic readiness probes against `/readyz/poststarthook/rbac/bootstrap-roles`, etcd health checks (`isEtcdHealthy` using etcd v3 Status over TLS), confirmation logic with max checks and wall-clock cap, and recovery via `restartMicroshiftEtcdScope` (executes `systemctl stop microshift-etcd.scope`). Adds bounded contexts and logging. (+182/-3).

Sequence Diagram

sequenceDiagram
    participant Run as Run Method
    participant Poll as Readiness Polling
    participant Detector as RBAC Detector
    participant Hook as /readyz/poststarthook/rbac/bootstrap-roles
    participant Etcd as Etcd (v3)
    participant Systemctl as systemctl

    Run->>Poll: start kube-apiserver readiness polling
    Run->>Detector: start RBAC deadlock detector goroutine

    Detector->>Detector: wait initial delay
    loop periodic checks (bounded by confirmation count & max wait)
        Detector->>Hook: GET /readyz/poststarthook/rbac/bootstrap-roles
        Detector->>Etcd: short-timeout Status probe (TLS)
        Etcd-->>Detector: healthy/unhealthy

        alt Etcd healthy & Hook not-ready confirmed
            Detector->>Systemctl: run "systemctl stop microshift-etcd.scope"
            Systemctl-->>Detector: result
            Detector-->>Run: close deadlock channel (signal)
            Run->>Run: log, attempt restart scope, return deadlock error
        else Hook ready or Etcd unhealthy or threshold not reached
            Detector->>Detector: continue polling or reset confirmation
        end
    end

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~40 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding RBAC bootstrap hook deadlock detection with fail-fast behavior. It references relevant Jira issues and clearly conveys the primary purpose of the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR modifies controller implementation file, not test files. No Ginkgo test patterns present, check not applicable.
Test Structure And Quality	✅ Passed	PR contains only production code modifications; no Ginkgo test code to assess.
Microshift Test Compatibility	✅ Passed	PR modifies production code only; no Ginkgo e2e tests introduced, so MicroShift test compatibility check not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR does not add any new Ginkgo e2e tests, so the SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies kube-apiserver controller with deadlock detection and etcd health monitoring. No scheduling constraints, affinity rules, nodeSelector, topology spread constraints, or PodDisruptionBudgets introduced. Changes are operational and diagnostic, compatible with all OpenShift topologies.
Ote Binary Stdout Contract	✅ Passed	The PR contains no non-JSON stdout writes in process-level code. The init() function performs only TLS profile lookup with no stdout output, and all logging occurs within method bodies outside the OTE contract scope.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR does not add new Ginkgo e2e tests; changes are exclusively in pkg/controllers/kube-apiserver.go implementing RBAC deadlock detection logic without test files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-07T16:05:40Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-04-07T16:06:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: copejon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [copejon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/controllers/kube-apiserver.go (1)

542-557: Function name is misleading.

restartMicroshiftEtcdScope only stops the scope; the restart occurs when MicroShift itself restarts after Run() returns an error. Consider renaming to stopMicroshiftEtcdScope for clarity.

♻️ Suggested rename

-// restartMicroshiftEtcdScope restarts the microshift-etcd.scope to recover from deadlock.
-// This forces a clean restart of etcd which can help break the circular dependency.
-func restartMicroshiftEtcdScope() error {
+// stopMicroshiftEtcdScope stops the microshift-etcd.scope to recover from deadlock.
+// MicroShift will restart both etcd and kube-apiserver after Run() returns an error.
+func stopMicroshiftEtcdScope() error {
 	klog.Info("Stopping microshift-etcd.scope for recovery")

Also update the call site at line 438.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 542 - 557, The function
restartMicroshiftEtcdScope is misnamed because it only stops the scope; rename
the function to stopMicroshiftEtcdScope (and update its comment and klog
messages accordingly) and update every call site that invokes
restartMicroshiftEtcdScope to call stopMicroshiftEtcdScope instead; ensure
exported/internal visibility remains unchanged and run tests/build to verify no
references remain to the old name.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 488-496: The current RBAC bootstrap loop resets checkCount to 0
when etcd is unhealthy which allows flapping to prevent ever reaching maxChecks;
modify the logic in the kube-apiserver.go RBAC detection loop (around variables
checkCount, maxChecks, and etcdHealthy) to track a wall-clock deadline (e.g.,
startTime or deadline) when the check begins and do not reset that deadline on
etcd state changes—always compare time.Since(startTime) (or
time.Now().After(deadline)) against a configured maxWaitDuration and fail/exit
the loop if exceeded; keep the existing checkCount logic for deadlock detection
but add this independent timeout check so flapping cannot extend detection
indefinitely.

---

Nitpick comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 542-557: The function restartMicroshiftEtcdScope is misnamed
because it only stops the scope; rename the function to stopMicroshiftEtcdScope
(and update its comment and klog messages accordingly) and update every call
site that invokes restartMicroshiftEtcdScope to call stopMicroshiftEtcdScope
instead; ensure exported/internal visibility remains unchanged and run
tests/build to verify no references remain to the old name.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 8e1dd33d-45f9-439d-91af-e5c44c9150fc

📥 Commits

Reviewing files that changed from the base of the PR and between 2cc28ca and db029bd.

📒 Files selected for processing (1)

pkg/controllers/kube-apiserver.go

copejon · 2026-04-07T22:07:46Z

@coderabbitai help

coderabbitai · 2026-04-07T22:07:51Z

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
- @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
- @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve merge conflict to automatically resolve merge conflicts.
@coderabbitai autofix to automatically fix issues identified in unresolved review comments.
@coderabbitai resolve to resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

Visit our status page to check the current availability of CodeRabbit.
Create a ticket on our support page for assistance with any issues or questions.
Visit our documentation site for detailed information on how to use CodeRabbit.
Join our Discord community to connect with other users and get help from the community.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 469-472: The loop that decides to signal deadlock is incrementing
checkCount and falling through to close(deadlockDetected) on
rbacHookMaxWaitDuration expiry even when isEtcdHealthy() or the RBAC probe
errored; modify the loop (the block around checkCount, isEtcdHealthy,
rbacHookMaxWaitDuration, maxChecks, and close(deadlockDetected)) so that
checkCount is only incremented when the "healthy-etcd stall" predicate is
actually observed (i.e., when both the RBAC probe is unfinished AND
isEtcdHealthy() returns true), ignore/skip increments when probes or health
checks error, and on hitting rbacHookMaxWaitDuration return early unless
checkCount has already reached maxChecks; ensure close(deadlockDetected) is only
called after the predicate has been confirmed maxChecks times.
- Around line 483-486: The RBAC readiness probe call using
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(ctx)...
uses the long-lived controller ctx and can hang; wrap that request in a short
per-probe context with timeout (e.g., context.WithTimeout(ctx, 2*time.Second)),
use the derived probeCtx in Do(probeCtx), and defer cancel() so the goroutine
unblocks and the 2s interval / 30s max-wait logic can re-evaluate; update the
call that sets status and err (StatusCode(&status).Error()) to use the probe
context.
- Around line 558-563: The call to stopCmd.CombinedOutput() in
restartMicroshiftEtcdScope is unbounded and can hang; wrap the systemctl
invocation with a context deadline (e.g., create a context.WithTimeout and use
exec.CommandContext) and use that context when constructing stopCmd, then read
CombinedOutput (or CombinedOutput on the context-backed command) and return a
timeout-aware error if the context expires; reference
restartMicroshiftEtcdScope, stopCmd and CombinedOutput and ensure you cancel the
context and propagate/annotate context timeout errors in the returned fmt.Errorf
message.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 6d409d90-afaf-448d-81e1-02888bc827c6

📥 Commits

Reviewing files that changed from the base of the PR and between 2cc28ca and 6de022e.

📒 Files selected for processing (1)

pkg/controllers/kube-apiserver.go

openshift-ci-robot · 2026-04-08T22:22:10Z

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

This pull request references USHIFT-6788 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver

Detect deadlock condition in ~15 seconds instead of waiting 60 seconds

Restart microshift-etcd.scope to recover from deadlock

USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically

Checks if etcd is healthy while the hook is stuck

If etcd is healthy but hook not completing for 15 seconds, declares deadlock

Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)

Restarts microshift-etcd.scope to force clean recovery

Test plan

Start MicroShift fresh - verify normal startup

Stop and restart MicroShift - verify no crash loop

If deadlock occurs, verify detection in ~15 seconds

Verify "RBAC bootstrap hook deadlock detected" log message on recovery

Verify wall-clock deadline prevents indefinite flapping extension

Verify make verify-go passes

Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401), [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-04-16T14:15:09Z

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

This pull request references USHIFT-6788 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver

Detect deadlock condition in ~15 seconds instead of waiting 60 seconds

Restart microshift-etcd.scope to recover from deadlock

USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically

Checks if etcd is healthy while the hook is stuck

If etcd is healthy but hook not completing for 15 seconds, declares deadlock

Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)

Restarts microshift-etcd.scope to force clean recovery

Test plan

Start MicroShift fresh - verify normal startup

Stop and restart MicroShift - verify no crash loop

If deadlock occurs, verify detection in ~15 seconds

Verify "RBAC bootstrap hook deadlock detected" log message on recovery

Verify wall-clock deadline prevents indefinite flapping extension

Verify make verify-go passes

Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401), [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Summary by CodeRabbit

Bug Fixes

Introduced detection mechanism for RBAC bootstrap initialization deadlocks with automatic recovery procedures. System now monitors initialization health and attempts service restart when deadlock conditions are identified to improve overall reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

♻️ Duplicate comments (1)

pkg/controllers/kube-apiserver.go (1)

468-515: ⚠️ Potential issue | 🟠 Major

Wall-clock deadline break still triggers false deadlock signal.

When rbacHookMaxWaitDuration expires at line 472, the loop breaks and falls through to close(deadlockDetected) at line 515—even if etcd was never confirmed healthy. The "etcd healthy + hook stuck" predicate may never have been observed.

Additionally, checkCount++ at line 481 executes before conditions are verified. If isEtcdHealthy errors, the count advances without observing the deadlock predicate.

Proposed fix

+	deadlockPredicateObserved := false
+
 	for checkCount < maxChecks {
 		// Check absolute deadline first - this cannot be reset by etcd state changes
 		if time.Since(startTime) >= rbacHookMaxWaitDuration {
 			klog.Errorf("RBAC bootstrap hook exceeded maximum wait duration of %v", rbacHookMaxWaitDuration)
-			break
+			if !deadlockPredicateObserved {
+				klog.Warning("Wall-clock deadline reached but deadlock predicate never confirmed - not signaling deadlock")
+				return
+			}
+			break // predicate was observed, signal deadlock
 		}
 
 		select {
 		case <-ctx.Done():
 			return
 		case <-time.After(rbacHookCheckInterval * time.Second):
 		}
 
-		checkCount++
-
 		// Check RBAC hook status
 		var status int
 		err := restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(ctx).StatusCode(&status).Error()
 
 		// If hook is ready, no deadlock
 		if err == nil && status == 200 {
 			klog.V(4).Info("RBAC bootstrap hook completed successfully")
 			return
 		}
 
 		// Hook not ready - check if etcd is healthy
 		etcdHealthy, etcdErr := isEtcdHealthy(ctx)
 		if etcdErr != nil {
 			klog.V(4).Infof("Could not check etcd health: %v", etcdErr)
 			continue
 		}
 
 		if etcdHealthy {
+			checkCount++
+			deadlockPredicateObserved = true
 			klog.Warningf("RBAC bootstrap hook not ready (check %d/%d, elapsed %v), but etcd is healthy - potential deadlock",
 				checkCount, maxChecks, time.Since(startTime).Round(time.Second))
 		} else {
 			// etcd not healthy - not a deadlock, just waiting for etcd
 			klog.V(4).Infof("RBAC hook waiting, etcd not yet healthy (check %d/%d)", checkCount, maxChecks)
-			// Reset counter since this isn't a deadlock condition
-			// Note: wall-clock deadline (startTime) is NOT reset - flapping cannot extend indefinitely
 			checkCount = 0
 		}
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 468 - 515, The loop currently
can fall through to close(deadlockDetected) on wall-clock timeout even if etcd
was never observed healthy and also increments checkCount before confirming etcd
health; fix by tracking whether etcd has ever been observed healthy (e.g., add
sawEtcdHealthy bool set to true when isEtcdHealthy(ctx) returns (true, nil)),
only increment checkCount when etcdHealthy==true and etcdErr==nil (move the
checkCount++ after the etcd health success path), and after the loop only signal
deadlock (close(deadlockDetected)) if sawEtcdHealthy is true and the RBAC hook
still isn’t ready; leave the wall-clock timeout to break the loop but not
unconditionally declare a deadlock. Ensure references:
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(...).StatusCode(&status).Error(),
isEtcdHealthy(ctx), checkCount, sawEtcdHealthy, and deadlockDetected.

🧹 Nitpick comments (1)

pkg/controllers/kube-apiserver.go (1)

566-568: Consider context-aware wait.

The 1s time.Sleep isn't interruptible. Low impact but a select with ctx.Done() would be cleaner for graceful shutdown.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 566 - 568, Replace the
uninterruptible time.Sleep(1 * time.Second) with a context-aware wait so
shutdown can be cancelled: where the sleep occurs (the time.Sleep call in
kube-apiserver.go cleanup/wait block), use a select that waits on time.After(1 *
time.Second) and also listens on ctx.Done(), returning early if ctx is
cancelled; ensure you reference the existing ctx variable and keep the same
timeout duration.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 468-515: The loop currently can fall through to
close(deadlockDetected) on wall-clock timeout even if etcd was never observed
healthy and also increments checkCount before confirming etcd health; fix by
tracking whether etcd has ever been observed healthy (e.g., add sawEtcdHealthy
bool set to true when isEtcdHealthy(ctx) returns (true, nil)), only increment
checkCount when etcdHealthy==true and etcdErr==nil (move the checkCount++ after
the etcd health success path), and after the loop only signal deadlock
(close(deadlockDetected)) if sawEtcdHealthy is true and the RBAC hook still
isn’t ready; leave the wall-clock timeout to break the loop but not
unconditionally declare a deadlock. Ensure references:
restClient.Get().AbsPath("/readyz/poststarthook/rbac/bootstrap-roles").Do(...).StatusCode(&status).Error(),
isEtcdHealthy(ctx), checkCount, sawEtcdHealthy, and deadlockDetected.

---

Nitpick comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 566-568: Replace the uninterruptible time.Sleep(1 * time.Second)
with a context-aware wait so shutdown can be cancelled: where the sleep occurs
(the time.Sleep call in kube-apiserver.go cleanup/wait block), use a select that
waits on time.After(1 * time.Second) and also listens on ctx.Done(), returning
early if ctx is cancelled; ensure you reference the existing ctx variable and
keep the same timeout duration.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 8aa2f8a5-61b0-40e5-a4e9-2c35b99a9835

📥 Commits

Reviewing files that changed from the base of the PR and between 2cc28ca and 6de022e.

📒 Files selected for processing (1)

pkg/controllers/kube-apiserver.go

copejon · 2026-04-16T23:20:09Z

/retest

openshift-ci-robot · 2026-04-17T14:24:36Z

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

This pull request references USHIFT-6788 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add parallel RBAC bootstrap hook deadlock detector to kube-apiserver

Detect deadlock condition in ~15 seconds instead of waiting 60 seconds

Restart microshift-etcd.scope to recover from deadlock

USHIFT-6788: Add wall-clock deadline (rbacHookMaxWaitDuration = 30s) to prevent etcd health flapping from extending detection indefinitely

Problem

MicroShift enters a permanent crash loop after first restart due to the RBAC bootstrap hook deadlocking when etcd contains existing data. The rbac/bootstrap-roles PostStartHook uses context.TODO() for API calls with no timeout, causing a circular dependency where the hook waits for API server readiness while the API server waits for the hook to complete.

Related upstream issues: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119

Solution

Add a fail-fast deadlock detector that:

Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically

Checks if etcd is healthy while the hook is stuck

If etcd is healthy but hook not completing for 15 seconds, declares deadlock

Enforces absolute 30s wall-clock deadline regardless of etcd state changes (prevents flapping from extending detection)

Restarts microshift-etcd.scope to force clean recovery

Test plan

Start MicroShift fresh - verify normal startup

Stop and restart MicroShift - verify no crash loop

If deadlock occurs, verify detection in ~15 seconds

Verify "RBAC bootstrap hook deadlock detected" log message on recovery

Verify wall-clock deadline prevents indefinite flapping extension

Verify make verify-go passes

Verify make test-unit passes

🤖 Generated with Claude Code via /jira:solve [USHIFT-6401](https://redhat.atlassian.net/browse/USHIFT-6401), [USHIFT-6788](https://redhat.atlassian.net/browse/USHIFT-6788)

Summary by CodeRabbit

Bug Fixes

Added runtime detection for RBAC bootstrap deadlocks with automatic recovery: the system now detects stuck initialization, logs an error, and attempts a bounded service restart to recover initialization and improve startup reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Around line 59-69: rbac timing constants rbacHookDeadlockTimeout and
rbacHookCheckInterval are plain integers but used as time.Duration; change them
to durations by multiplying by time.Second (rbacHookDeadlockTimeout = 15 *
time.Second and rbacHookCheckInterval = 2 * time.Second), then update the
log/format that prints the timeout to use %s (duration) instead of “%d seconds”,
and remove any places that multiply rbacHookCheckInterval by time.Second (e.g.
the spot that computes maxChecks / or multiplies the interval) so all arithmetic
uses time.Duration consistently.
- Around line 496-515: The current RBAC probe logic treats any non-nil err from
restClient.Get() as a transport error and skips deadlock counting; change the
condition so only requests with no HTTP status code (transport/timeout errors)
are skipped — i.e., update the check around restClient.Get()/probeCtx to
skip/continue only when err != nil AND status == 0 (use the existing status
integer and err from the Do(...).StatusCode(&status).Error() call), while
allowing HTTP responses (even 5xx where status != 0) to be counted toward the
deadlock check and preserve the existing klog message for the skipped case.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 7ca5fc05-cf7b-4e48-970f-e7400f1307ef

📥 Commits

Reviewing files that changed from the base of the PR and between 7daee5d and 184e1cb.

📒 Files selected for processing (1)

pkg/controllers/kube-apiserver.go

…tion On MicroShift restart, the RBAC bootstrap hook can deadlock when etcd contains existing data. The hook uses context.TODO() for API calls, which has no timeout. When the loopback client hangs, this creates a circular dependency where the hook waits for the API server while the API server waits for the hook to complete. This change adds a parallel deadlock detector that: - Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically - Checks if etcd is healthy while the hook is stuck - Detects deadlock in ~15 seconds instead of waiting 60 seconds - Restarts microshift-etcd.scope to recover from the deadlock This breaks the crash loop by detecting the condition early and taking recovery action at the MicroShift level, without requiring changes to vendored upstream Kubernetes code. Related upstream issues: kubernetes/kubernetes#86715, #97119 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add rbacHookMaxWaitDuration (30s) as an absolute deadline that cannot be reset by etcd health state changes. This prevents a flapping etcd from extending the deadlock detection indefinitely. The existing checkCount logic is preserved for detecting deadlock when etcd is consistently healthy, but the wall-clock deadline provides a hard upper bound regardless of etcd state transitions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…itives - Only increment checkCount when deadlock predicate confirmed (RBAC not ready AND etcd healthy) - Skip counting when RBAC probe or etcd health check errors - On wall-clock timeout, only trigger recovery if checkCount >= maxChecks - Add 1s timeout to RBAC probe to prevent hanging on unresponsive API - Add 5s timeout to systemctl stop to prevent recovery path from stalling - Extract rbacHookPollDelayStart constant for clarity Prevents false positive deadlock detection when etcd flaps or probes error, ensuring close(deadlockDetected) only fires after confirming the deadlock condition the required number of times. Related: kubernetes/kubernetes#86715, #97119 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

pkg/controllers/kube-apiserver.go (2)

371-371: ⚠️ Potential issue | 🔴 Critical

Use duration constants directly.

Line 371 now overflows because kubeAPIStartupTimeout is already a time.Duration. Line 493 also turns the 2s detector interval into a multi-decade sleep. Use the constants directly and format durations with %s.

Proposed fix

-		err := wait.PollUntilContextTimeout(ctx, time.Second, kubeAPIStartupTimeout*time.Second, true, func(ctx context.Context) (bool, error) {
+		err := wait.PollUntilContextTimeout(ctx, time.Second, kubeAPIStartupTimeout, true, func(ctx context.Context) (bool, error) {
...
-		return fmt.Errorf("RBAC bootstrap hook deadlock detected after %d seconds", rbacHookDeadlockTimeout)
+		return fmt.Errorf("RBAC bootstrap hook deadlock detected after %s", rbacHookDeadlockTimeout)
...
-	case <-time.After(rbacHookCheckInterval * time.Second):
+	case <-time.After(rbacHookCheckInterval):

Verify the stale multipliers are gone:

#!/bin/bash
rg -n 'kubeAPIStartupTimeout\s*\*\s*time\.Second|rbacHookCheckInterval\s*\*\s*time\.Second|after %d seconds' pkg/controllers/kube-apiserver.go

Also applies to: 446-446, 490-494

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` at line 371, The code is multiplying
existing time.Duration constants (e.g. kubeAPIStartupTimeout,
rbacHookCheckInterval) by time.Second which causes overflow and wrong intervals;
update calls like wait.PollUntilContextTimeout(ctx, time.Second,
kubeAPIStartupTimeout*time.Second, ...) and any other occurrences
(rbacHookCheckInterval*time.Second) to pass the durations directly (use
kubeAPIStartupTimeout and rbacHookCheckInterval as the timeout/interval
arguments) and change any formatted log strings that use "after %d seconds" to
use "%s" with the duration value; search for usages in functions referencing
kubeAPIStartupTimeout and rbacHookCheckInterval and replace the multipliers and
integer %d format with the direct duration and %s respectively.

496-515: ⚠️ Potential issue | 🟠 Major

Don’t skip HTTP not-ready responses.

client-go can return both status != 0 and err != nil for HTTP error responses, so Line 513 skips the RBAC “not ready” signal the detector needs. Only skip transport/timeout failures where no HTTP status was received.

Proposed fix

-		if err != nil {
+		if err != nil && status == 0 {
 			klog.V(4).Infof("RBAC probe error (not counting toward deadlock): %v", err)
 			continue
 		}

Verify client-go Result status/error behavior in the vendored source:

#!/bin/bash
sed -n '496,515p' pkg/controllers/kube-apiserver.go
rg -n -A8 -B4 'func \(r .*Result\) StatusCode|func \(r .*Result\) Error|func \(r \*Request\) transformResponse' ./deps/github.com/openshift/kubernetes/staging/src/k8s.io/client-go/rest/request.go

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/controllers/kube-apiserver.go` around lines 496 - 515, The code currently
treats any err from restClient.Get().Do(...).Error() as a transport/timeout
failure and skips counting toward deadlock; change that logic so only
transport-level failures (err != nil AND status == 0) are skipped. In other
words, keep the probe request creation (probeCtx, cancel, restClient.Get...
StatusCode(&status).Error()), then replace the "if err != nil { ... continue }"
check with a conditional that logs and continues only when err != nil && status
== 0; if err != nil but status != 0, treat it as an HTTP error response (do not
continue) so the RBAC “not ready” HTTP status is observed by the detector
(referencing probeCtx, status, and err).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controllers/kube-apiserver.go`:
- Line 70: The const block in kube-apiserver.go contains a stray character "1"
immediately before its closing parenthesis which breaks Go syntax; remove that
stray "1" so the const (...) block ends with just the closing ')' and nothing
else, ensuring the const declaration (the const block surrounding the
kube-apiserver constants) compiles cleanly.

---

Duplicate comments:
In `@pkg/controllers/kube-apiserver.go`:
- Line 371: The code is multiplying existing time.Duration constants (e.g.
kubeAPIStartupTimeout, rbacHookCheckInterval) by time.Second which causes
overflow and wrong intervals; update calls like
wait.PollUntilContextTimeout(ctx, time.Second,
kubeAPIStartupTimeout*time.Second, ...) and any other occurrences
(rbacHookCheckInterval*time.Second) to pass the durations directly (use
kubeAPIStartupTimeout and rbacHookCheckInterval as the timeout/interval
arguments) and change any formatted log strings that use "after %d seconds" to
use "%s" with the duration value; search for usages in functions referencing
kubeAPIStartupTimeout and rbacHookCheckInterval and replace the multipliers and
integer %d format with the direct duration and %s respectively.
- Around line 496-515: The code currently treats any err from
restClient.Get().Do(...).Error() as a transport/timeout failure and skips
counting toward deadlock; change that logic so only transport-level failures
(err != nil AND status == 0) are skipped. In other words, keep the probe request
creation (probeCtx, cancel, restClient.Get... StatusCode(&status).Error()), then
replace the "if err != nil { ... continue }" check with a conditional that logs
and continues only when err != nil && status == 0; if err != nil but status !=
0, treat it as an HTTP error response (do not continue) so the RBAC “not ready”
HTTP status is observed by the detector (referencing probeCtx, status, and err).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 16493598-1abd-4ebf-8630-0d3b8ae52d52

📥 Commits

Reviewing files that changed from the base of the PR and between 21b06c5 and 097404a.

📒 Files selected for processing (1)

pkg/controllers/kube-apiserver.go

openshift-ci · 2026-04-22T19:37:26Z

@copejon: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

copejon · 2026-04-22T19:59:38Z

/verified by ci

openshift-ci-robot · 2026-04-22T19:59:51Z

@copejon: This PR has been marked as verified by ci.

Details

In response to this:

/verified by ci

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 7, 2026

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 7, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2026

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread pkg/controllers/kube-apiserver.go

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread pkg/controllers/kube-apiserver.go Outdated

Comment thread pkg/controllers/kube-apiserver.go

Comment thread pkg/controllers/kube-apiserver.go Outdated

copejon changed the title ~~USHIFT-6401: Add fail-fast RBAC bootstrap hook deadlock detection~~ USHIFT-6401, USHIFT-6788: Add fail-fast RBAC bootstrap hook deadlock detection Apr 8, 2026

copejon marked this pull request as ready for review April 16, 2026 14:12

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2026

openshift-ci Bot requested review from eslutsky and pacevedom April 16, 2026 14:13

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

copejon force-pushed the fix-USHIFT-6401 branch from 7daee5d to 184e1cb Compare April 17, 2026 23:37

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread pkg/controllers/kube-apiserver.go

Comment thread pkg/controllers/kube-apiserver.go

copejon and others added 3 commits April 22, 2026 10:20

copejon force-pushed the fix-USHIFT-6401 branch 2 times, most recently from b64ecdb to 097404a Compare April 22, 2026 16:50

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread pkg/controllers/kube-apiserver.go Outdated

copejon force-pushed the fix-USHIFT-6401 branch from 097404a to 266c5aa Compare April 22, 2026 17:09

plumb root context through restart polling logic

2449af6

copejon force-pushed the fix-USHIFT-6401 branch from 266c5aa to 2449af6 Compare April 22, 2026 17:47

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 22, 2026

Conversation

copejon commented Apr 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Apr 7, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Test plan

Uh oh!

coderabbitai Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Apr 7, 2026

Uh oh!

openshift-ci Bot commented Apr 7, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

copejon commented Apr 7, 2026

Uh oh!

coderabbitai Bot commented Apr 7, 2026

Chat

CodeRabbit commands

Other keywords and placeholders

Status, support, documentation and community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 8, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Test plan

Uh oh!

openshift-ci-robot commented Apr 16, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

copejon commented Apr 16, 2026

Uh oh!

openshift-ci-robot commented Apr 17, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

copejon commented Apr 7, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Apr 7, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Apr 7, 2026 •

edited

Loading

openshift-ci-robot commented Apr 8, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Apr 16, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Apr 17, 2026 •

edited by openshift-ci Bot

Loading